5420 Anomaly Detection | Assignment 4 - Joyce Ng (jn2901)¶

The purpose of conducting anomaly analysis on healthcare/inpatient data is to identify and understand unusual patterns or outliers that could indicate potential issues or opportunities for improvement in healthcare delivery, cost management, and patient outcomes. Objective of this project is to perform exhaustive EDA on inpatient data and provide business insights of each feature.Then utilize KNN and PCA from PyOD to identify anomalies.¶

Desciption of dataset:

DRG Definition:

Diagnosis Related Group. Classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.

Total Discharges:

The number of discharges billed by all providers for inpatient hospital services.

Average Covered Charges:

= Total Covered Charge Amount / Total Discharges. The average charge of all provider's services covered by Medicare for discharges in the DRG. These will vary from hospital to hospital because of differences in hospital charge structures.

Average Total Payment:

= Total Payments / Total Discharges. The average total payments of what Medicare actually pays to the provider as well as co-payment and deductible amounts that the beneficiary is responsible for and payments by third parties for coordination of benefits.

Average Medicare Payment:

= Medicare Payment Amount / Total Discharges. The average amount that Medicare pays to the provider. Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.

Table of Contents¶

  • Section 1: Data Preparation
    • 1.1 Load Libraries and Dataset
    • 1.2 Check Missing Values & Change Columns Names and Data Type
  • Section 2: EDA
    • 2.1 Distribution of Variables
      • 2.1.1 Distributions of Avg. Payments
      • 2.1.2 Distributions of Charges and Payments
      • 2.1.3 DRG Frequency
      • 2.1.4 Distributions of Charges and Payments by State
  • Section 3: Feature Engineering
    • 3.1 Average Medicare Payments
      • 3.1.1 By DRG
      • 3.1.2 Per DRG by State
      • 3.1.3 Per DRG by State & City
      • 3.1.4 Per DRG by Zip Code
    • 3.2 Average Total Payments
      • 3.2.1 Per DRG by State
      • 3.2.2 Per DRG by State & City
      • 3.2.3 Per DRG by Zip Code
    • 3.3 Average Out-Of-Pocket Payments
      • 3.3.1 Per DRG by State
      • 3.3.2 Per DRG by State & City
      • 3.3.3 Per DRG by Zip Code
  • Section 4: K-Nearest Neighbor (KNN)
    • 4.1 KNN
      • 4.1.1 Outliers List
      • 4.1.2 Top 10 Outliers by DRG
      • 4.1.3 Outliers by State
    • 4.2 Combination by Average
      • 4.2.1 Outliers List
      • 4.2.2 Top 10 Outliers by DRG
      • 4.2.3 Outliers by State
  • Section 5: Principal Component Analysis (PCA)
    • 5.1 PCA
      • 5.1.1 Outliers List
      • 5.1.2 Top 10 Outliers by DRG
      • 5.1.3 Outliers by State
  • Section 6: Model Comparison

     

Section 1: Data Preparation ¶

1.1 Load Libraries and Dataset ¶

In [1]:
# Load libraries
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from sklearn import metrics
from pyod.models.knn import KNN
from pyod.models.pca import PCA

# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
warnings.filterwarnings("ignore")
In [2]:
data = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/4 - Healthcare Fraud/inpatientCharges.csv')
data.head()
Out[2]:
DRG Definition Provider Id Provider Name Provider Street Address Provider City Provider State Provider Zip Code Hospital Referral Region Description Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 $5777.24 $4763.73
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 $15131.85 $5787.57 $4976.71
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 $37560.37 $5434.95 $4453.79
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 $13998.28 $5417.56 $4129.16
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 $31633.27 $5658.33 $4851.44
In [3]:
print(data.shape, "\n") # Check the shape of df
print(data.columns, "\n") # Check column names
print(data.info(), "\n") # Check info of the df
(163065, 12) 

Index(['DRG Definition', 'Provider Id', 'Provider Name',
       'Provider Street Address', 'Provider City', 'Provider State',
       'Provider Zip Code', 'Hospital Referral Region Description',
       ' Total Discharges ', ' Average Covered Charges ',
       ' Average Total Payments ', 'Average Medicare Payments'],
      dtype='object') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype 
---  ------                                --------------   ----- 
 0   DRG Definition                        163065 non-null  object
 1   Provider Id                           163065 non-null  int64 
 2   Provider Name                         163065 non-null  object
 3   Provider Street Address               163065 non-null  object
 4   Provider City                         163065 non-null  object
 5   Provider State                        163065 non-null  object
 6   Provider Zip Code                     163065 non-null  int64 
 7   Hospital Referral Region Description  163065 non-null  object
 8    Total Discharges                     163065 non-null  int64 
 9    Average Covered Charges              163065 non-null  object
 10   Average Total Payments               163065 non-null  object
 11  Average Medicare Payments             163065 non-null  object
dtypes: int64(3), object(9)
memory usage: 14.9+ MB
None 

1.2: Check Missing Values & Change Columns Names and Data Type ¶

In [4]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values
Out[4]:
DRG Definition                          0
Provider Id                             0
Provider Name                           0
Provider Street Address                 0
Provider City                           0
Provider State                          0
Provider Zip Code                       0
Hospital Referral Region Description    0
 Total Discharges                       0
 Average Covered Charges                0
 Average Total Payments                 0
Average Medicare Payments               0
dtype: int64
In [5]:
# remove white space in 
data = data.rename(columns=lambda x: x.strip())
In [6]:
# Change data type 
data = data.rename(columns=lambda x: x.strip())
data['DRG Definition'] = data['DRG Definition'].astype('category')
data['Provider State'] = data['Provider State'].astype('category')
data['Provider City'] = data['Provider City'].astype('category')
data['Provider Id'] = data['Provider Id'].astype('category')
                   
# Remove $ sign from columns
data['Average Covered Charges'] = data['Average Covered Charges'].str.strip("$").astype((float))
data['Average Total Payments'] = data['Average Total Payments'].str.strip("$").astype((float))
data['Average Medicare Payments'] = data['Average Medicare Payments'].str.strip("$").astype((float))
In [7]:
# Remove white space in columns
data.columns = ['DRG','Provider_Id', 'Provider_Name','Provider_StreetAddress','Provider_City',
               'Provider_State','Provider_Zipcode','Hospital_referral_region_desp',
                'Total_Discharges','Average_Covered_Charges','Average_Total_Payments',
                'Average_Medicare_Payments']
In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
 #   Column                         Non-Null Count   Dtype   
---  ------                         --------------   -----   
 0   DRG                            163065 non-null  category
 1   Provider_Id                    163065 non-null  category
 2   Provider_Name                  163065 non-null  object  
 3   Provider_StreetAddress         163065 non-null  object  
 4   Provider_City                  163065 non-null  category
 5   Provider_State                 163065 non-null  category
 6   Provider_Zipcode               163065 non-null  int64   
 7   Hospital_referral_region_desp  163065 non-null  object  
 8   Total_Discharges               163065 non-null  int64   
 9   Average_Covered_Charges        163065 non-null  float64 
 10  Average_Total_Payments         163065 non-null  float64 
 11  Average_Medicare_Payments      163065 non-null  float64 
dtypes: category(4), float64(3), int64(2), object(3)
memory usage: 11.1+ MB
In [9]:
data.head()
Out[9]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 5787.57 4976.71
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 5434.95 4453.79
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 5417.56 4129.16
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 5658.33 4851.44

Section 2: EDA ¶

2.1: Distribution of Variables ¶

In [10]:
unique_id =  len(data.groupby('Provider_Id').count())
unique_providers = len(data.groupby('Provider_Name').count())
unique_states = len(data.groupby('Provider_State').count())
unique_cities = len(data.groupby('Provider_City').count())
unique_drg = len(data.groupby('DRG').count())

print(f'Unique ID = {unique_id}\n')
print(f'Unique Providers = {unique_providers}\n')
print(f'Unique States = {unique_states}\n')
print(f'Unique Cities = {unique_cities}\n')
print(f'Unique DRG = {unique_drg}\n')
Unique ID = 3337

Unique Providers = 3201

Unique States = 51

Unique Cities = 1977

Unique DRG = 100

In [11]:
data.describe()
Out[11]:
Provider_Zipcode Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments
count 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000
mean 47938.121908 42.776304 36133.954224 9707.473804 8494.490964
std 27854.323080 51.104042 35065.365931 7664.642598 7309.467261
min 1040.000000 11.000000 2459.400000 2673.000000 1148.900000
25% 27261.000000 17.000000 15947.160000 5234.500000 4192.350000
50% 44309.000000 27.000000 25245.820000 7214.100000 6158.460000
75% 72901.000000 49.000000 43232.590000 11286.400000 10056.880000
max 99835.000000 3383.000000 929118.900000 156158.180000 154620.810000

2.1 Distribution of Avg. Payments ¶

In [12]:
# Create a figure
fig = go.Figure()

# Add histogram for Average Total Payments
fig.add_trace(go.Histogram(x=data['Average_Total_Payments'], name='Average_Total_Payments', opacity=1))
fig.add_trace(go.Histogram(x=data['Average_Medicare_Payments'], name='Average_Medicare_Payments', opacity=0.85))

# Update layout
fig.update_layout(
    title_text='Distribution of Average Total & Medicare Payments',
    xaxis_title_text='Payments ($)',
    yaxis_title_text='Frequency',
    height=800, width=950,
    template = 'seaborn',
    plot_bgcolor = '#F6F5F5'
)

# Show plot
fig.show()

Based on the graph, it's notable that the average Medicare payment exceeds the overall average payment, which is unusual and potentially concerning. Therefore, further investigation is necessary to identify and understand these discrepancies.

2.1.2 Distributions of Charges and Payments by States ¶

In [13]:
sns.pairplot(data[['Provider_State','Average_Total_Payments','Total_Discharges','Average_Medicare_Payments','Average_Covered_Charges']], 
             hue= 'Provider_State')
Out[13]:
<seaborn.axisgrid.PairGrid at 0x1685d6dd0>
No description has been provided for this image

The above graph show the distributions of different charges and payments on each states. There are already some outliers showed on the graph which need further investigations.

2.1.3 DRG Frequency ¶

Identify the most common DRGs by grouping the DRG and total number of discharges, then sort in an descending order.

In [14]:
common_drg = data.groupby('DRG')
common_drg.agg({'Total_Discharges': np.sum}).sort_values(by=['Total_Discharges'], ascending=0)[:20]
Out[14]:
Total_Discharges
DRG
470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC 427207
871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC 319072
392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC 244854
292 - HEART FAILURE & SHOCK W CC 222038
690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC 206695
194 - SIMPLE PNEUMONIA & PLEURISY W CC 198390
291 - HEART FAILURE & SHOCK W MCC 185599
641 - MISC DISORDERS OF NUTRITION,METABOLISM,FLUIDS/ELECTROLYTES W/O MCC 153660
683 - RENAL FAILURE W CC 150444
190 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W MCC 149677
191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC 148491
312 - SYNCOPE & COLLAPSE 141918
603 - CELLULITIS W/O MCC 140894
378 - G.I. HEMORRHAGE W CC 138678
313 - CHEST PAIN 131079
193 - SIMPLE PNEUMONIA & PLEURISY W MCC 127989
287 - CIRCULATORY DISORDERS EXCEPT AMI, W CARD CATH W/O MCC 115920
192 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W/O CC/MCC 114790
310 - CARDIAC ARRHYTHMIA & CONDUCTION DISORDERS W/O CC/MCC 113454
872 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC 112430
In [15]:
drg470 = data[data['DRG'].str.contains('470')]

The most common procedures finsihed by the hospital was 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC, In the following I created a scatter plot to take a deeper look at the average medicare payment of this group.

In [16]:
# Create scatter plot using Plotly Express with color mapped to 'Average_Covered_Charges'
fig = px.scatter(drg470, x='Provider_State', y='Average_Medicare_Payments',
                 color='Average_Medicare_Payments', 
                 color_continuous_scale='PuBuGn',
                 title='Average Medicare Payments for DRG 470 by State',
                 labels={'Provider_State': 'State', 'Average_Medicare_Payments': 'Average Medicare Payments'})

# Customize layout
fig.update_layout(
    xaxis_title='State', 
    yaxis_title='Average Medicare Payments',
    template='plotly_white'
)

# Show the figure
fig.show()

For DRG 470, the average medicare payments seems to be within 10k to 20k, with some states i.e. Maryland, California, New York over 25k or even above 30k.

Next, I'll take a look at how much each procedures had cost by each state.

In [17]:
def stat(data,group_var,cont_var):
    coln =list(data[[group_var,cont_var]].describe().index)
    row = data[[group_var]].drop_duplicates().reset_index(drop=True).values.tolist()
    row = [item for sublist in row for item in sublist]
    desc_cont_var = pd.DataFrame(data[[group_var,cont_var]].groupby(group_var).describe().values.tolist(), columns=coln,index=row)
    return desc_cont_var
In [18]:
stat(data,'DRG','Average_Total_Payments').head(n=10)
Out[18]:
count mean std min 25% 50% 75% max
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 1079.0 6960.534004 1477.873952 4968.00 6001.8300 6582.890 7516.8250 18420.56
057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC 1201.0 6706.276445 2033.965862 4194.09 5412.8700 6093.750 7345.3600 25519.43
069 - TRANSIENT ISCHEMIA 1659.0 13263.823032 3847.918207 8174.28 10762.2200 12084.700 14424.3250 50882.40
064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC 2269.0 7922.671141 2084.658336 5368.73 6626.2700 7280.050 8503.0600 26510.15
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 1806.0 5713.985221 1342.538675 3916.41 4819.3250 5326.025 6197.4800 14744.05
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 1962.0 5068.644292 1114.407934 3514.41 4320.7375 4740.235 5459.7375 12312.80
074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC 979.0 6386.793177 1597.733831 4344.17 5278.4200 5903.250 7060.7800 15759.14
101 - SEIZURES W/O MCC 1593.0 5493.361883 1420.448742 3704.89 4530.4100 5058.160 6013.8600 15554.42
149 - DYSEQUILIBRIUM 988.0 4589.104342 1082.833515 3185.77 3826.4325 4285.315 5036.1925 9325.21
176 - PULMONARY EMBOLISM W/O MCC 1396.0 7279.954979 1544.880403 4891.30 6238.9800 6848.800 7933.6125 15783.05
In [19]:
stat(data,'DRG','Average_Medicare_Payments').head(n=10)
Out[19]:
count mean std min 25% 50% 75% max
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 1079.0 5555.837525 1236.063439 3592.85 4759.8000 5269.280 6033.6100 15855.18
057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC 1201.0 5701.676570 1973.620083 3116.42 4467.6600 5104.780 6293.6000 22873.49
069 - TRANSIENT ISCHEMIA 1659.0 12112.849445 3702.047762 6603.63 9765.6300 10956.160 13094.6300 48632.28
064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC 2269.0 6781.501785 1979.482762 4109.25 5555.9100 6167.510 7342.0300 23402.26
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 1806.0 4469.203560 1271.852222 2771.88 3653.9750 4079.575 4862.4650 13710.23
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 1962.0 3952.275102 1094.682773 2348.07 3240.5050 3627.465 4298.7650 9959.58
074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC 979.0 5238.579438 1542.402076 2456.83 4201.5650 4815.660 5862.4700 13310.28
101 - SEIZURES W/O MCC 1593.0 4452.600709 1379.483972 2629.69 3535.5800 4012.660 4966.9500 13961.18
149 - DYSEQUILIBRIUM 988.0 3440.677065 1076.033108 2012.35 2717.5875 3145.555 3805.2575 8615.31
176 - PULMONARY EMBOLISM W/O MCC 1396.0 6077.948660 1505.418988 3561.68 5122.6200 5644.930 6554.8175 15161.27

2.1.4 Choropleth Map ¶

In [20]:
def plot_map(data, group_var, cont_var, stat='mean', color='RdBu_r', label=''):
    # Grouping the Data
    if stat == 'count':
        x = data.groupby(group_var).size()  # Count number of rows per group
    else:
        x = data.groupby(group_var)[cont_var].agg(stat)  # Compute specified statistic
    
  # Plotting Heat Map
    fig = go.Figure(go.Choropleth(
        locations=list(x.index), 
        z=x.astype(float),       
        locationmode='USA-states',
        colorscale=color,
        colorbar_title=f'{stat.capitalize()} {label.lower()}',
    ))
    
    fig.update_layout(
        title_text=f'{cont_var.replace("_", " ").title()} Per State',
        geo_scope='usa',
        height=800, width=950
    )
    
    return fig
In [21]:
# Plot the distribution of variables by states using choropleth map
p = [
    plot_map(data, "Provider_State", "Total_Discharges", stat='count'),
    plot_map(data, "Provider_State", "Average_Covered_Charges", stat='mean', label="Covered Charges"),
    plot_map(data, "Provider_State", "Average_Total_Payments", stat='mean', label="Total Payments"),
    plot_map(data, "Provider_State", "Average_Medicare_Payments", stat='mean', label="Medicare Payments")
]

for fig in p:
    fig.show()
     

Section 3: Feature Engineering ¶

3.1 Avgerage Medicare Payments ¶

For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average medicare payments.

Four features are created under this section which are Average Medicare Payments by DRG, States, Cities & Zip Code. They can provides insights into healthcare payment patterns across various levels of granularity.

3.1.1 By DRG ¶

In [22]:
table_1 = data.groupby(['DRG'])['Average_Medicare_Payments'].mean().reset_index()
table_1 = table_1.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Payments_byDRG'})
table_1.head(10)
Out[22]:
DRG Average_Medicare_Payments_byDRG
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 5555.837525
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 5701.676570
2 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... 12112.849445
3 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... 6781.501785
4 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... 4469.203560
5 069 - TRANSIENT ISCHEMIA 3952.275102
6 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O... 5238.579438
7 101 - SEIZURES W/O MCC 4452.600709
8 149 - DYSEQUILIBRIUM 3440.677065
9 176 - PULMONARY EMBOLISM W/O MCC 6077.948660

To better identify outliers, additional quantiles i.e 90%, 95% are added by using the following function:¶

In [23]:
def calculate_descriptive_stats(data, column):

    '''
    This function calculate descriptive statistics with additional quantile.
    '''
    
    # Calculate basic descriptive statistics
    desc_stats = data[column].describe()

    # Calculate the 90th and 95th percentiles
    perc_90 = data[column].quantile(0.90)
    perc_95 = data[column].quantile(0.95)

    # Convert the descriptive statistics to a DataFrame
    desc_stats = desc_stats.to_frame().transpose()

    # Add the 90th and 95th percentiles to the DataFrame
    desc_stats['90%'] = perc_90
    desc_stats['95%'] = perc_95

    # Reorder the columns to match the desired format
    desc_stats = desc_stats[['mean', 'min', '25%', '50%', '75%', '90%', '95%', 'max']]

    # Transpose the DataFrame
    desc_stats = desc_stats.T

    return desc_stats
In [24]:
table_1_stats = calculate_descriptive_stats(table_1, 'Average_Medicare_Payments_byDRG')
table_1_stats
Out[24]:
Average_Medicare_Payments_byDRG
mean 9248.575795
min 2876.015593
25% 4465.052848
50% 6391.828157
75% 11444.865846
90% 17047.411069
95% 21501.661996
max 41899.432929

Based on the above table, payments that exceeded the 95% threshold should be considered as anomalies.

In [25]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(data, table_1, how='left', on=['DRG'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by DRG
df1['Average_Medicare_byDRG_Ratio'] = np.where(df1['Average_Medicare_Payments_byDRG']==0,0, df1['Average_Medicare_Payments'] / df1['Average_Medicare_Payments_byDRG'])

Distribution of Average Medicare Payments By DRG Ratio¶

In [26]:
# Define variables and quantile range
var = 'Average_Medicare_byDRG_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments By DRG Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature focuses on understanding how Medicare payments vary across different DRGs. It helps in identifying anomalies such as DRGs with unusually high or low Medicare payments compared to the overall average.

3.1.2 Per DRG by State ¶

In [27]:
table_2 = data.groupby(['DRG', 'Provider_State'])['Average_Medicare_Payments'].mean().reset_index()
table_2.columns = ['DRG', 'Provider_State', 'Average_Medicare_Bystates']
table_2.head(10)
Out[27]:
DRG Provider_State Average_Medicare_Bystates
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AK 6413.780000
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 4599.593043
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AR 4938.712500
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AZ 5912.832917
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CA 6878.954925
5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CO 5419.930000
6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CT 6709.101333
7 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DC 8110.816667
8 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DE 5719.316667
9 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC FL 4868.594568
In [28]:
table_2_stats = calculate_descriptive_stats(table_2, 'Average_Medicare_Bystates')
table_2_stats
Out[28]:
Average_Medicare_Bystates
mean 9289.670980
min 2216.372500
25% 4483.274762
50% 6635.284161
75% 11044.008750
90% 18103.081589
95% 24208.021478
max 63646.045000
In [29]:
# Add average medicare payments by states to the dataset 
df2 = pd.merge(data, table_2, how='left', on=['DRG', 'Provider_State'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df2['Average_Medicare_Bystates_Ratio'] = np.where(df2['Average_Medicare_Bystates']==0,0, df2['Average_Medicare_Payments'] / df2['Average_Medicare_Bystates'])
df2['Average_Medicare_Bystates_Ratio'].describe()
Out[29]:
count    163065.000000
mean          1.000000
std           0.215091
min           0.339831
25%           0.868395
50%           0.954250
75%           1.071483
max           5.174415
Name: Average_Medicare_Bystates_Ratio, dtype: float64

Distribution of Average Medicare Payments Per DRG By State Ratio¶

In [30]:
# Define variables and quantile range
var = 'Average_Medicare_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df2[binned_var] = pd.qcut(df2[var], percentile)

# Create the bin labels
df2['bin_label'] = df2[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df2['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments State Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature adds state-level granularity to the analysis allows for identifying regional disparities in Medicare payments for each DRG. It helps understand variations in healthcare costs influenced by state-level policies, demographics, and healthcare infrastructure. Payments fall witin the higher range i.e. [1.74-5.17] could be flagged as anomalies.

3.1.3 Per DRG by State & City ¶

In [31]:
table_3 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Medicare_Payments'].mean().reset_index()
table_3 = table_3.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Bystatescity'}).sort_values(by='Average_Medicare_Bystatescity', ascending=False)

table_3.head(10)
Out[31]:
DRG Provider_State Provider_City Average_Medicare_Bystatescity
2126950 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA STANFORD 154620.81
9244283 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... NY VALHALLA 133177.26
2125896 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA FREMONT 113462.09
9285667 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS CA STANFORD 109303.21
9345110 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS NY VALHALLA 107456.18
5312030 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC NY VALHALLA 100040.26
6831303 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... PA HERSHEY 99114.86
2186393 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... NY VALHALLA 98792.46
9184840 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... CA STANFORD 94614.03
9361733 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS TX GALVESTON 89857.83
In [32]:
table_3_stats = calculate_descriptive_stats(table_3, 'Average_Medicare_Bystatescity')
table_3_stats
Out[32]:
Average_Medicare_Bystatescity
mean 8133.157995
min 1148.900000
25% 4070.555000
50% 5925.720000
75% 9701.325000
90% 14810.964000
95% 20455.067000
max 154620.810000
In [33]:
# Add average medicare payments by states to the dataset 
df3 = pd.merge(data, table_3, how='left', on=['DRG', 'Provider_State', 'Provider_City'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df3['Average_Medicare_Bystatescity_Ratio'] = np.where(df3['Average_Medicare_Bystatescity']==0,0, df3['Average_Medicare_Payments'] / df3['Average_Medicare_Bystatescity'])

Distribution of Average Medicare Payments By State and City Ratio¶

In [34]:
# Define variables and quantile range
var = 'Average_Medicare_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df3[binned_var] = pd.qcut(df3[var], percentile, duplicates='drop')

# Create the bin labels
df3['bin_label'] = df3[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df3['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments By State and City Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [35]:
# Calculate 95th percentile threshold
threshold = np.percentile(df3[var], 95)

# Create anomaly indicator
df3['Anomaly'] = np.where(df3[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df3['Anomaly'].value_counts()
anomaly_counts
Out[35]:
Anomaly
0    154911
1      8154
Name: count, dtype: int64

This feature further disaggregating by city within each state provides insights into localized healthcare payment patterns. It identifies cities where Medicare payments differ markedly across various DRGs, indicating potential anomalies in healthcare service delivery or cost structures at the city level.

3.1.4 Avgerage Medicare Payments per DRG by Zip Code ¶

In [36]:
table_4 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Medicare_Payments'].mean().reset_index()
table_4 = table_4.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Byzip'}).sort_values(by='Average_Medicare_Byzip', ascending=False)

table_4.head(10)
Out[36]:
DRG Provider_Zipcode Average_Medicare_Byzip
66990 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94305 154620.81
278041 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... 10595 133177.26
66996 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94538 113462.09
283753 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 94305 109303.21
281094 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 10595 107456.18
283585 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 90095 101717.82
158974 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 10595 100040.26
204951 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... 17033 99114.86
64331 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 10595 98792.46
278317 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... 20060 95701.42
In [37]:
table_4_stats = calculate_descriptive_stats(table_4, 'Average_Medicare_Byzip')
table_4_stats
Out[37]:
Average_Medicare_Byzip
mean 8455.257017
min 1148.900000
25% 4172.160000
50% 6125.210000
75% 10014.280000
90% 15625.386000
95% 21653.620000
max 154620.810000
In [38]:
# Add average medicare payments by states to the dataset 
df4 = pd.merge(data, table_4, how='left', on=['DRG', 'Provider_Zipcode'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df4['Average_Medicare_Byzip_Ratio'] = np.where(df4['Average_Medicare_Byzip']==0,0, df4['Average_Medicare_Payments'] / df4['Average_Medicare_Byzip'])

Distribution of Average Medicare Payments By Zip Code Ratio¶

In [39]:
# Define variables and quantile range
var = 'Average_Medicare_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df4[binned_var] = pd.qcut(df4[var], percentile, duplicates='drop')

# Create the bin labels
df4['bin_label'] = df4[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df4['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments By Zip Code Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [40]:
# Calculate 95th percentile threshold
threshold = np.percentile(df4[var], 95)

# Create anomaly indicator
df4['Anomaly'] = np.where(df4[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df4['Anomaly'].value_counts()
anomaly_counts
Out[40]:
Anomaly
0    155070
1      7995
Name: count, dtype: int64

The count of anomaly will mainly performed on 'by state & city' and 'by zip code' because those two levels provide more indepth insights.

This feature add zip code-level granularity offers the most detailed view of Medicare payment variations within specific cities. It identifies zip codes where Medicare payments deviate significantly within the same city and state, pinpointing localized anomalies or disparities in healthcare utilization and costs.

3.2 Average Total Payments ¶

For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average Total payments.

Three features are created under this section which are Average Total Payments by DRG, States, Cities & Zip Code. They can provides comprehensive insights into healthcare expenditure patterns across various levels of granularity.

3.2.1 Per DRG by State ¶

In [41]:
table_5 = data.groupby(['DRG', 'Provider_State'])['Average_Total_Payments'].mean().reset_index()
table_5 = table_5.rename(columns={'Average_Total_Payments': 'Average_Total_Bystate'}).sort_values(by='Average_Total_Bystate', ascending=False)

table_5.head(10)
Out[41]:
DRG Provider_State Average_Total_Bystate
2652 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC AK 68006.425000
4648 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... DC 66437.692500
4692 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS AK 61300.630000
4641 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... AK 57238.680000
4652 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... HI 56482.922000
4726 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS NY 56465.890641
1071 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... AK 54385.475000
4696 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS CA 53733.333710
4738 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS VT 52761.330000
4699 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS DC 52491.154000
In [42]:
table_5_stats = calculate_descriptive_stats(table_5, 'Average_Total_Bystate')
table_5_stats
Out[42]:
Average_Total_Bystate
mean 10640.517669
min 3236.026333
25% 5583.888889
50% 7780.726250
75% 12493.163000
90% 19845.450487
95% 27083.356325
max 68006.425000
In [43]:
# Add average medicare payments by states to the dataset 
df5 = pd.merge(data, table_5, how='left', on=['DRG', 'Provider_State'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df5['Average_Total_Bystates_Ratio'] = np.where(df5['Average_Total_Bystate']==0,0, df5['Average_Total_Payments'] / df5['Average_Total_Bystate'])

Distribution of Average Total Payments By State Ratio¶

In [44]:
# Define variables and quantile range
var = 'Average_Total_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df5[binned_var] = pd.qcut(df5[var], percentile)

# Create the bin labels
df5['bin_label'] = df5[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df5['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)

# Update the layout
fig.update_layout(
    title='Distribution of Average Total Payments By State Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature adds state-level granularity to the analysis identifies regional variations in total payments for each DRG.

3.2.2 Per DRG by State & City ¶

In [45]:
table_6 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Total_Payments'].mean().reset_index()
table_6 = table_6.rename(columns={'Average_Total_Payments': 'Average_Total_Bystatescity'}).sort_values(by='Average_Total_Bystatescity', ascending=False)

table_6.head(10)
Out[45]:
DRG Provider_State Provider_City Average_Total_Bystatescity
2126950 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA STANFORD 156158.18
9244283 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... NY VALHALLA 140255.26
2125896 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA FREMONT 119113.00
9345110 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS NY VALHALLA 119028.90
9285667 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS CA STANFORD 109945.57
5312030 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC NY VALHALLA 101796.26
6831303 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... PA HERSHEY 100018.33
2186393 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... NY VALHALLA 99929.46
9184840 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... CA STANFORD 95302.74
9361733 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS TX GALVESTON 90701.58
In [46]:
table_6_stats = calculate_descriptive_stats(table_6, 'Average_Total_Bystatescity')
table_6_stats
Out[46]:
Average_Total_Bystatescity
mean 9316.580074
min 2673.000000
25% 5105.865000
50% 6969.000000
75% 10860.125000
90% 16328.638000
95% 22435.810500
max 156158.180000
In [47]:
# Add average medicare payments by states to the dataset 
df6 = pd.merge(data, table_6, how='left', on=['DRG', 'Provider_State', 'Provider_City'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df6['Average_Total_Bystatescity_Ratio'] = np.where(df6['Average_Total_Bystatescity']==0,0, df6['Average_Total_Payments'] / df6['Average_Total_Bystatescity'])

Distribution of Average Total Payments By State and City Ratio¶

In [48]:
# Define variables and quantile range
var = 'Average_Total_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df6[binned_var] = pd.qcut(df6[var], percentile, duplicates='drop')

# Create the bin labels
df6['bin_label'] = df6[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df6['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)

# Update the layout
fig.update_layout(
    title='Distribution of Average Total Payments By State and City Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [49]:
# Calculate 95th percentile threshold
threshold = np.percentile(df6[var], 95)

# Create anomaly indicator
df6['Anomaly'] = np.where(df6[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df6['Anomaly'].value_counts()
anomaly_counts
Out[49]:
Anomaly
0    154911
1      8154
Name: count, dtype: int64

Simliar to Average medicare payments, this feature add additional level of granularity in detecting anomalies.

3.2.3 Per DRG by Zip Code ¶

In [50]:
table_7 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Total_Payments'].mean().reset_index()
table_7 = table_7.rename(columns={'Average_Total_Payments': 'Average_Total_Byzip'}).sort_values(by='Average_Total_Byzip', ascending=False)

table_7.head(10)
Out[50]:
DRG Provider_Zipcode Average_Total_Byzip
66990 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94305 156158.18
278041 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... 10595 140255.26
66996 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94538 119113.00
281094 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 10595 119028.90
283753 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 94305 109945.57
283585 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 90095 103802.96
158974 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 10595 101796.26
204951 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... 17033 100018.33
64331 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 10595 99929.46
159489 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 29220 99307.30
In [51]:
table_7_stats = calculate_descriptive_stats(table_7, 'Average_Total_Byzip')
table_7_stats
Out[51]:
Average_Total_Byzip
mean 9665.549998
min 2682.640000
25% 5215.000000
50% 7178.760000
75% 11233.000000
90% 17190.889000
95% 23786.226000
max 156158.180000
In [52]:
# Add average medicare payments by states to the dataset 
df7 = pd.merge(data, table_7, how='left', on=['DRG', 'Provider_Zipcode'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df7['Average_Total_Byzip_Ratio'] = np.where(df7['Average_Total_Byzip']==0,0, df7['Average_Total_Payments'] / df7['Average_Total_Byzip'])

Distribution of Average Total Payments By Zip Code Ratio¶

In [53]:
# Define variables and quantile range
var = 'Average_Total_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df7[binned_var] = pd.qcut(df7[var], percentile, duplicates='drop')

# Create the bin labels
df7['bin_label'] = df7[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df7['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)

# Update the layout
fig.update_layout(
    title='Distribution of Average Total Payments By Zip Code Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [54]:
# Calculate 95th percentile threshold
threshold = np.percentile(df7[var], 95)

# Create anomaly indicator
df7['Anomaly'] = np.where(df7[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df7['Anomaly'].value_counts()
anomaly_counts
Out[54]:
Anomaly
0    155056
1      8009
Name: count, dtype: int64

Simliar to avg medicare payments, this feature add additional level of granularity.

This section of Average Total Payments focuses all costs associated with healthcare services, offering a broader perspective on anomalies related to overall service costs, resource, and patient payments.

3.3 Average Out of Pocket Payments ¶

For this section, benchmarks are calculated by using Total payments minus Medicare payments to get the Average Out-Of-Pocket Payments of Patients. Then group the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average OOP payments.

Three features are created under this section which are Average OOP Payments per DRG, by States, Cities & Zip Code. They can provides insights into healthcare expenditure patterns across various levels of granularity from the patient's perspective. Hospital that charges a higher than average OOP cost could considered anomalies.

3.3.1 Per DRG by States ¶

In [55]:
data['Average_OOP'] = data['Average_Total_Payments'] - data['Average_Medicare_Payments']
data.head()
Out[55]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Average_OOP
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 1013.51
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 5787.57 4976.71 810.86
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 5434.95 4453.79 981.16
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 5417.56 4129.16 1288.40
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 5658.33 4851.44 806.89
In [56]:
table_8 = data.groupby(['DRG', 'Provider_State'])['Average_OOP'].mean().reset_index()
table_8 = table_8.rename(columns={'Average_OOP': 'Average_OOP_Bystates'})
table_8.head(10)
Out[56]:
DRG Provider_State Average_OOP_Bystates
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AK 1988.170000
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 1144.018696
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AR 1180.116250
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AZ 1358.722917
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CA 1508.977761
5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CO 1631.992000
6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CT 1301.466000
7 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DC 1383.156667
8 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DE 1408.896667
9 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC FL 1334.597778
In [57]:
table_8_stats = calculate_descriptive_stats(table_8, 'Average_OOP_Bystates')
table_8_stats
Out[57]:
Average_OOP_Bystates
mean 1350.846689
min 359.710000
25% 957.941455
50% 1105.953333
75% 1424.579545
90% 2109.421204
95% 2687.367642
max 19964.515000
In [58]:
# Add average medicare payments by states to the dataset 
df8 = pd.merge(data, table_8, how='left', on=['DRG', 'Provider_State'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df8['Average_OOP_Bystates_Ratio'] = np.where(df8['Average_OOP_Bystates']==0,0, df8['Average_OOP'] / df8['Average_OOP_Bystates'])

Distribution of Average Out-Of-Pocket Payments By State Ratio¶

In [59]:
# Define variables and quantile range
var = 'Average_OOP_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df8[binned_var] = pd.qcut(df8[var], percentile)

# Create the bin labels
df8['bin_label'] = df8[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df8['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor_r)

# Update the layout
fig.update_layout(
    title='Distribution of Average Out-Of-Pocket Payments By State Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature provides insights into the disparities in patient expenses across different states. Payments fall into the high range i.e. [1.91-3.20], [3.20-38.19] could be could be consider anomalies.

3.3.2 Per DRG by State & City ¶

In [60]:
table_9 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_OOP'].mean().reset_index()
table_9 = table_9.rename(columns={'Average_OOP': 'Average_OOP_Bystatescity'}).sort_values(by='Average_OOP_Bystatescity', ascending=False)
table_9.head(10)
Out[60]:
DRG Provider_State Provider_City Average_OOP_Bystatescity
2916206 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... WA BELLEVUE 75998.660
6748071 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC WA FEDERAL WAY 51457.650
3054632 252 - OTHER VASCULAR PROCEDURES W MCC IN BLOOMINGTON 41686.250
2054178 203 - BRONCHITIS & ASTHMA W/O CC/MCC MA ATTLEBORO 39691.910
2805011 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... SD SIOUX FALLS 37708.605
2682933 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... NJ BERLIN 37034.820
6723696 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC NY WEST ISLIP 35437.920
1286505 189 - PULMONARY EDEMA & RESPIRATORY FAILURE PA READING 33117.260
5329100 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC TX MCKINNEY 32245.250
2551805 244 - PERMANENT CARDIAC PACEMAKER IMPLANT W/O ... IN RICHMOND 32146.450
In [61]:
table_9_stats = calculate_descriptive_stats(table_9, 'Average_OOP_Bystatescity')
table_9_stats
Out[61]:
Average_OOP_Bystatescity
mean 1183.422079
min 0.000000
25% 783.055000
50% 938.000000
75% 1204.592778
90% 1846.114000
95% 2579.456000
max 75998.660000
In [62]:
# Add average medicare payments by states to the dataset 
df9 = pd.merge(data, table_9, how='left', on=['DRG', 'Provider_State', 'Provider_City'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df9['Average_OOP_Bystatescity_Ratio'] = np.where(df9['Average_OOP_Bystatescity']==0,0, df9['Average_OOP'] / df9['Average_OOP_Bystatescity'])

Distribution of Average Out-Of-Pocket Payments By State & City Ratio¶

In [63]:
# Define variables and quantile range
var = 'Average_OOP_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df9[binned_var] = pd.qcut(df9[var], percentile, duplicates='drop')

# Create the bin labels
df9['bin_label'] = df9[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df9['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)

# Update the layout
fig.update_layout(
    title='Distribution of Average Out-Of-Pocket Payments By State & City Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [64]:
# Calculate 95th percentile threshold
threshold = np.percentile(df9[var], 95)

# Create anomaly indicator
df9['Anomaly'] = np.where(df9[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df9['Anomaly'].value_counts()
anomaly_counts
Out[64]:
Anomaly
0    154911
1      8154
Name: count, dtype: int64

Similar as the previous features, this adds city level granularity. Payments fall in the higher bin could be flagged as anomalies.

3.3.3 Per DRG by Zip Code ¶

In [65]:
table_10 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_OOP'].mean().reset_index()
table_10 = table_10.rename(columns={'Average_OOP': 'Average_OOP_Byzip'}).sort_values(by='Average_OOP_Byzip', ascending=False)
table_10.head(10)
Out[65]:
DRG Provider_Zipcode Average_OOP_Byzip
88487 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... 98004 75998.66
84205 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... 57108 74168.03
37116 189 - PULMONARY EDEMA & RESPIRATORY FAILURE 19605 65079.84
204500 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 98003 51457.65
65978 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 60657 41849.50
93126 252 - OTHER VASCULAR PROCEDURES W MCC 47403 41686.25
61115 203 - BRONCHITIS & ASTHMA W/O CC/MCC 2703 39691.91
79549 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... 8009 37034.82
201759 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 11795 35437.92
90311 251 - PERC CARDIOVASC PROC W/O CORONARY ARTERY... 57108 34721.65
In [66]:
table_10_stats = calculate_descriptive_stats(table_10, 'Average_OOP_Byzip')
table_10_stats
Out[66]:
Average_OOP_Byzip
mean 1210.292981
min 0.000000
25% 778.770000
50% 938.000000
75% 1222.000000
90% 1923.914000
95% 2715.580000
max 75998.660000
In [67]:
# Add average medicare payments by states to the dataset 
df10 = pd.merge(data, table_10, how='left', on=['DRG', 'Provider_Zipcode'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df10['Average_OOP_Byzip_Ratio'] = np.where(df10['Average_OOP_Byzip']==0,0, df10['Average_OOP'] / df10['Average_OOP_Byzip'])

Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio¶

In [68]:
# Define variables and quantile range
var = 'Average_OOP_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df10[binned_var] = pd.qcut(df10[var], percentile, duplicates='drop')

# Create the bin labels
df10['bin_label'] = df10[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df10['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)

# Update the layout
fig.update_layout(
    title='Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [69]:
# Calculate 95th percentile threshold
threshold = np.percentile(df10[var], 95)

# Create anomaly indicator
df10['Anomaly'] = np.where(df10[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df10['Anomaly'].value_counts()
anomaly_counts
Out[69]:
Anomaly
0    155070
1      7995
Name: count, dtype: int64

Similar as the previous features, this adds more level of granularity. Payments fall in the higher bin could be flagged as anomalies.

Data Preparation for Modeling¶

In [70]:
# Append new features back to orginal dataset for future anlaysis
# Make a copy of the original DataFrame
data_copy = data.copy()

# List of DataFrames
dfs = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10]

# Function to filter and merge only the 'Ratio' columns
def merge_ratio_columns(left, right):
    ratio_columns = [col for col in right.columns if col.endswith('Ratio')]
    return pd.merge(left, right[['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns], 
                    on=['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'], how='left')

# Use reduce to iteratively merge all DataFrames with the ratio columns into df_copy
final_df = reduce(merge_ratio_columns, dfs, data_copy)

# Display the final merged DataFrame
final_df.head()
Out[70]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges ... Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 0.857428 1.035685 1.038763 1.0 1.005855 1.037810 1.0 0.885921 1.033356 1.0
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 ... 0.895762 1.081989 1.000000 1.0 1.007653 1.000000 1.0 0.708782 1.000000 1.0
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 ... 0.801642 0.968301 1.000000 1.0 0.946260 1.000000 1.0 0.857643 1.000000 1.0
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 ... 0.743211 0.897723 0.899200 1.0 0.943232 0.915208 1.0 1.126205 0.970586 1.0
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 ... 0.873215 1.054754 1.000000 1.0 0.985152 1.000000 1.0 0.705312 1.000000 1.0

5 rows × 23 columns

In [71]:
# Extracting only the ratio columns
ratio_columns = [col for col in final_df.columns if col.endswith('Ratio')]
X = final_df[ratio_columns]

# Splitting data into train and test sets for X (features only)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=123)

Section 4: K-Nearest Neighbor (KNN)

4.1 KNN ¶

KNN identifies anomalies by comparing the distance of each data point to its nearest neighbors. Points that are far from their neighbors are considered outliers. This approach is easy to implement and understand. It is non-parametric and computationally intensive for large datasets. Also sensitive to noise which could lead to incorrect outlier detection

In [72]:
# Function to count the unique values
def count_stat(vector):
    unique, counts = np.unique(vector, return_counts=True)
    return dict(zip(unique, counts))

def descriptive_stat_threshold(df, pred_score, threshold):
    df = pd.DataFrame(df)
    df['Anomaly_Score'] = pred_score
    df['Group'] = np.where(df['Anomaly_Score'] < threshold, 'Normal', 'Outlier')
    cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score': 'Count'})
    cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100
    stat = df.groupby('Group').mean().round(2).reset_index()
    stat = cnt.merge(stat, on='Group')
    return stat
In [73]:
# Build the KNN model
knn = KNN(contamination=0.05)
knn.fit(X_train)

# Predict anomaly scores for training and test data
y_train_scores = knn.decision_function(X_train)
y_train_pred = knn.predict(X_train)

y_test_scores = knn.decision_function(X_test)
y_test_pred = knn.predict(X_test)

# Print the statistics of the predictions for training and test data
print("Training Data Predictions:", count_stat(y_train_pred))
print("Test Data Predictions:", count_stat(y_test_pred))

# Get descriptive statistics for outliers
threshold = knn.threshold_

# Print the threshold for the defined contamination rate
print("Threshold for Contamination Rate:", knn.threshold_, '\n')
Training Data Predictions: {0: 124620, 1: 5832}
Test Data Predictions: {0: 30977, 1: 1636}
Threshold for Contamination Rate: 0.20077006129635047 

In [74]:
knn.get_params()
Out[74]:
{'algorithm': 'auto',
 'contamination': 0.05,
 'leaf_size': 30,
 'method': 'largest',
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': 1,
 'n_neighbors': 5,
 'p': 2,
 'radius': 1.0}
In [75]:
stats_train = descriptive_stat_threshold(pd.DataFrame(X_train, columns=ratio_columns), y_train_scores, threshold)
stats_test = descriptive_stat_threshold(pd.DataFrame(X_test, columns=ratio_columns), y_test_scores, threshold)

# Descriptive statistics for train data
print("Training Data Statistics:")
stats_train
Training Data Statistics:
Out[75]:
Group Count Count % Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Anomaly_Score
0 Normal 124620 95.52939 0.98 0.99 1.00 1.00 0.98 0.99 1.00 0.96 0.99 1.00 0.05
1 Outlier 5832 4.47061 1.34 1.24 1.07 1.01 1.33 1.11 1.02 1.92 1.30 1.04 0.33
In [76]:
# Descriptive statistics for train data
print("Test Data Statistics:")
stats_test
Test Data Statistics:
Out[76]:
Group Count Count % Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Anomaly_Score
0 Normal 30977 94.983595 0.98 0.99 1.00 1.00 0.98 0.99 1.00 0.95 0.99 1.00 0.05
1 Outlier 1636 5.016405 1.32 1.22 1.07 1.01 1.31 1.10 1.02 1.89 1.29 1.04 0.33

4.1.1 KNN Outlier List ¶

In [77]:
# Add a 'Prediction' column to original dataset
final_df['Prediction'] = 0  # Initialize with 0

# Mark outliers in training data
final_df.loc[X_train.index, 'Prediction'] = y_train_pred

# Mark outliers in test data
final_df.loc[X_test.index, 'Prediction'] = y_test_pred
final_df.describe()
Out[77]:
Provider_Zipcode Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Average_OOP Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction
count 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000
mean 47938.121908 42.776304 36133.954224 9707.473804 8494.490964 1212.982840 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999994 0.999994 0.045798
std 27854.323080 51.104042 35065.365931 7664.642598 7309.467261 1148.097584 0.281375 0.215091 0.101956 0.038675 0.195274 0.092549 0.035360 0.586527 0.229179 0.086610 0.209047
min 1040.000000 11.000000 2459.400000 2673.000000 1148.900000 0.000000 0.337408 0.339831 0.261737 0.261737 0.384805 0.238195 0.303604 0.000000 0.000000 0.000000 0.000000
25% 27261.000000 17.000000 15947.160000 5234.500000 4192.350000 776.310000 0.820838 0.868395 1.000000 1.000000 0.880824 1.000000 1.000000 0.721250 1.000000 1.000000 0.000000
50% 44309.000000 27.000000 25245.820000 7214.100000 6158.460000 935.720000 0.917180 0.954250 1.000000 1.000000 0.956830 1.000000 1.000000 0.882159 1.000000 1.000000 0.000000
75% 72901.000000 49.000000 43232.590000 11286.400000 10056.880000 1221.410000 1.090344 1.071483 1.000000 1.000000 1.064688 1.000000 1.000000 1.101927 1.000000 1.000000 0.000000
max 99835.000000 3383.000000 929118.900000 156158.180000 154620.810000 75998.660000 7.672359 5.174415 3.846861 1.782259 7.857934 3.753099 1.897813 38.192611 10.028956 3.545154 1.000000
In [78]:
# Filter out outliers
outliers = final_df[final_df['Prediction'] == 1]

# Display the list of outliers
outliers_list = outliers[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction']]
print("List of Outliers:")
outliers_list
List of Outliers:
Out[78]:
DRG Provider_Id Provider_Name Provider_State Provider_City Provider_Zipcode Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction
44 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 30100 CARONDELET HEART AND VASCULAR INSTITUTE AZ TUCSON 85704 0.646680 0.607636 0.731709 1.000000 0.821832 0.942372 1.000000 1.753963 1.665115 1.000000 1
67 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 50017 MERCY GENERAL HOSPITAL CA SACRAMENTO 95819 1.246383 1.006650 0.844865 1.000000 1.295101 1.042120 1.000000 2.610058 1.767791 1.000000 1
88 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 50180 JOHN MUIR MEDICAL CENTER - WALNUT CREEK CAMPUS CA WALNUT CREEK 94598 1.219708 0.985106 1.000000 1.000000 1.473051 1.000000 1.000000 3.697437 1.000000 1.000000 1
104 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 50334 SALINAS VALLEY MEMORIAL HOSPITAL CA SALINAS 93901 1.269125 1.025018 1.000000 1.000000 1.996299 1.000000 1.000000 6.424064 1.000000 1.000000 1
123 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 50599 UNIVERSITY OF CALIFORNIA DAVIS MEDICAL CENTER CA SACRAMENTO 95817 1.756405 1.418573 1.190586 1.000000 1.372732 1.104587 1.000000 1.163755 0.788210 1.000000 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
162869 948 - SIGNS & SYMPTOMS W/O MCC 450508 MEMORIAL HOSPITAL TX NACOGDOCHES 75961 0.809226 0.846520 0.974436 0.974436 0.931781 1.091734 1.091734 1.289130 1.632630 1.632630 1
162934 948 - SIGNS & SYMPTOMS W/O MCC 490032 MEDICAL COLLEGE OF VIRGINIA HOSPITALS VA RICHMOND 23298 1.510141 1.751285 1.538383 1.000000 2.039643 1.661285 1.000000 3.093812 1.990335 1.000000 1
162972 948 - SIGNS & SYMPTOMS W/O MCC 500008 UNIVERSITY OF WASHINGTON MEDICAL CTR WA SEATTLE 98195 1.747121 1.759734 1.445275 1.000000 1.706477 1.452658 1.000000 1.541961 1.479298 1.000000 1
163030 948 - SIGNS & SYMPTOMS W/O MCC 520059 AURORA MEMORIAL HSPTL BURLINGTON WI BURLINGTON 53105 0.676813 0.737801 1.000000 1.000000 1.512762 1.000000 1.000000 4.173365 1.000000 1.000000 1
163050 948 - SIGNS & SYMPTOMS W/O MCC 520138 AURORA ST LUKES MEDICAL CENTER WI MILWAUKEE 53215 1.073500 1.170233 0.937221 1.045942 1.234657 1.046721 1.100769 1.455838 1.544807 1.286938 1

7468 rows × 17 columns

In [79]:
print(f' The Outlier list contains {len(outliers_list.groupby("Provider_Name").count())} providers who performed {len(outliers_list.groupby("DRG").count())} procedures across {len(outliers_list.groupby("Provider_State").count())} states', '\n')
 The Outlier list contains 1107 providers who performed 100 procedures across 51 states 

4.1.2 Top 10 Outliers by DRG ¶

In [80]:
outliers_list.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
Out[80]:
count
DRG
329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 186
207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT 96+ HOURS 184
314 - OTHER CIRCULATORY SYSTEM DIAGNOSES W MCC 171
208 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT <96 HOURS 165
460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 160
871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC 156
470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC 149
252 - OTHER VASCULAR PROCEDURES W MCC 134
870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 133
682 - RENAL FAILURE W MCC 132

4.1.3 Outliers by State ¶

In [81]:
outliers_list.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)
Out[81]:
count
Provider_State
CA 1069
NY 937
TX 725
IL 544
FL 411
MD 377
PA 355
MO 168
MN 165
NM 145
MI 143
LA 137
AZ 136
WI 135
IN 129
OH 126
OK 122
GA 119
MA 110
AL 109
VA 106
WA 103
NJ 98
TN 93
AK 78
MS 74
CO 64
CT 63
KY 63
SC 58
HI 57
AR 57
DC 56
KS 51
IA 44
NC 41
OR 38
WV 35
NV 34
NE 26
SD 20
UT 14
RI 13
WY 6
NH 5
ID 4
ME 4
MT 1
DE 0
VT 0
ND 0

Key Takeaways

  • A total of 7,468 instances (~5% of total data) were identified as outlier
  • Both training and test data show a clear majority of instances classified as normal, with a smaller percentage identified as outliers.
  • The average values of ratio columns (related to Medicare payments, total payments, and out-of-pocket expenses) are higher in outlier groups across both datasets, indicating potential areas of abnormal behavior or higher variability.
  • The anomaly detection model appears consistent between training and test data, with similar patterns in group distribution and anomaly scores.
  • 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC and CA state appears to be see the most fraudulent activities.

4.2 Combination by Average ¶

This approach provides a consolidated view of anomaly scores across multiple KNN models.

In [82]:
# Averaging multiple KNN models
# Standardize data
X_train_norm, X_test_norm = standardizer(X_train, X_test)

# Test a range of k-neighbors from 10 to 200
k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 
 120, 130, 140, 150, 160, 170, 180, 190, 200]
n_clf = len(k_list)

# Prepare data frames to store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])

# Modeling
for i, k in enumerate(k_list):
    clf = KNN(n_neighbors=k, method='largest')
    clf.fit(X_train_norm)
    train_scores[:, i] = clf.decision_scores_
    test_scores[:, i] = clf.decision_function(X_test_norm)

# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores)

# Combination by average for training and test data
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
In [83]:
# Calculate new threshold for Combination by average
def set_threshold_by_contamination_rate(anomaly_scores, desired_contamination_rate):
    threshold = np.percentile(anomaly_scores, 100 * (1 - desired_contamination_rate))
    return threshold

# Example: Set the contamination rate to 0.05%
desired_contamination_rate = 0.05
comb_threshold = set_threshold_by_contamination_rate(y_train_by_average, desired_contamination_rate)

# Print the new threshold
print("New Threshold for Contamination Rate:", comb_threshold)
New Threshold for Contamination Rate: 1.6723343404873459
In [84]:
stats_train_comb = descriptive_stat_threshold(pd.DataFrame(X_train, columns=ratio_columns), y_train_by_average, comb_threshold)
stats_test_comb = descriptive_stat_threshold(pd.DataFrame(X_test, columns=ratio_columns), y_test_by_average, comb_threshold)

# Descriptive statistics for train data
print("Training Data Statistics:")
stats_train_comb
Training Data Statistics:
Out[84]:
Group Count Count % Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Anomaly_Score
0 Normal 123929 94.999693 0.98 0.99 1.00 1.00 0.99 1.00 1.00 0.97 0.99 1.00 -0.16
1 Outlier 6523 5.000307 1.32 1.23 1.07 1.01 1.28 1.08 1.01 1.63 1.19 1.01 3.09
In [85]:
# Descriptive statistics for test data
print("Test Data Statistics:")
stats_test_comb
Test Data Statistics:
Out[85]:
Group Count Count % Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Anomaly_Score
0 Normal 30985 95.008126 0.98 0.99 1.00 1.00 0.98 1.00 1.00 0.97 0.99 1.00 -0.17
1 Outlier 1628 4.991874 1.32 1.23 1.06 1.01 1.28 1.08 1.01 1.63 1.21 1.02 3.08
In [86]:
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
    fig = go.Figure()

    # Split the scores into two parts: below the threshold and above/equal to the threshold
    below_threshold = scores[scores < threshold]
    above_threshold = scores[scores >= threshold]

    # Add histogram traces for the two parts
    fig.add_trace(go.Histogram(x=below_threshold, name='Below Threshold', marker=dict(color='#09b18e')))
    fig.add_trace(go.Histogram(x=above_threshold, name='Above Threshold', marker=dict(color='#8f63e2')))
    
    # Add vertical line annotating the threshold
    fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
                  annotation_position="top right", annotation=dict(font=dict(color="#0d5087", size=12)))

    # Update layout
    fig.update_layout(
        title=title,
        xaxis_title="Outlier Score",
        yaxis_title="Frequency",
        showlegend=True,
        height=450,
        width=800,
        bargap=0.01,
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )

    fig.update_xaxes(range=[-1, 5])

    # Show the plot
    fig.show()

# Plotting the combination by average for training data
plot_histogram_with_threshold(np.array(y_train_by_average), "Training Data - Combination by Average", comb_threshold)

# Plotting the combination by average for test data
plot_histogram_with_threshold(np.array(y_test_by_average), "Test Data - Combination by Average", comb_threshold)

4.2.1 Combination by AVG Outlier List ¶

In [87]:
# Add a column 'Prediction_Comb_Avg' to final_df and initialize with 0 (not outlier)
final_df['Prediction_Comb_Avg'] = 0

# Mark outliers in training data
final_df.loc[X_train.index[y_train_by_average >= comb_threshold], 'Prediction_Comb_Avg'] = 1

# Mark outliers in test data
final_df.loc[X_test.index[y_test_by_average >= comb_threshold], 'Prediction_Comb_Avg'] = 1
final_df.head()
Out[87]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges ... Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction Prediction_Comb_Avg
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 1.038763 1.0 1.005855 1.037810 1.0 0.885921 1.033356 1.0 0 0
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 ... 1.000000 1.0 1.007653 1.000000 1.0 0.708782 1.000000 1.0 0 0
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 ... 1.000000 1.0 0.946260 1.000000 1.0 0.857643 1.000000 1.0 0 0
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 ... 0.899200 1.0 0.943232 0.915208 1.0 1.126205 0.970586 1.0 0 0
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 ... 1.000000 1.0 0.985152 1.000000 1.0 0.705312 1.000000 1.0 0 0

5 rows × 25 columns

In [88]:
# Filter out outliers
outliers_comb = final_df[final_df['Prediction_Comb_Avg'] == 1]

# Display the list of outliers
outliers_list_comb = outliers_comb[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_Comb_Avg']]
print("List of Outliers:")
outliers_list_comb.head(5)
List of Outliers:
Out[88]:
DRG Provider_Id Provider_Name Provider_State Provider_City Provider_Zipcode Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction_Comb_Avg
104 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 50334 SALINAS VALLEY MEMORIAL HOSPITAL CA SALINAS 93901 1.269125 1.025018 1.000000 1.00000 1.996299 1.000000 1.000000 6.424064 1.000000 1.0000 1
245 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 110010 EMORY UNIVERSITY HOSPITAL GA ATLANTA 30322 1.090453 1.179870 1.216306 1.00000 1.514428 1.408846 1.000000 2.853551 1.908965 1.0000 1
280 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 140008 LOYOLA GOTTLIEB MEMORIAL HOSPITAL IL MELROSE PARK 60160 0.792296 0.817234 1.000000 1.00000 2.497097 1.000000 1.000000 7.042845 1.000000 1.0000 1
294 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 140080 RHC ST FRANCIS HOSPITAL IL EVANSTON 60202 1.248071 1.287354 1.103888 1.00000 2.351565 1.390816 1.000000 5.231343 1.681938 1.0000 1
356 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 150109 FRANCISCAN ST ELIZABETH HEALTH - LAFAYETTE EAST IN LAFAYETTE 47905 0.925007 1.033613 1.170310 1.17031 0.984917 0.838043 0.838043 0.818963 0.377300 0.3773 1
In [89]:
print(f' The Combination by AVG Outlier list contains {len(outliers_list_comb.groupby("Provider_Name").count())} providers who performed {len(outliers_list_comb.groupby("DRG").count())} procedures accross {len(outliers_list_comb.groupby("Provider_State").count())} states', '\n')
 The Combination by AVG Outlier list contains 894 providers who performed 100 procedures accross 51 states 

4.2.2 Top 10 Outlier by DRG ¶

In [90]:
outliers_list_comb.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
Out[90]:
count
DRG
470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC 171
291 - HEART FAILURE & SHOCK W MCC 165
208 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT <96 HOURS 161
853 - INFECTIOUS & PARASITIC DISEASES W O.R. PROCEDURE W MCC 159
329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 150
460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 149
682 - RENAL FAILURE W MCC 144
314 - OTHER CIRCULATORY SYSTEM DIAGNOSES W MCC 143
392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC 139
292 - HEART FAILURE & SHOCK W CC 129

4.2.3 Outlier by State ¶

In [91]:
outliers_list_comb.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)
Out[91]:
count
Provider_State
NY 1190
CA 1092
TX 969
IL 605
FL 449
PA 416
MD 336
MO 293
OK 202
LA 201
NM 201
MS 156
WI 146
MI 134
MN 133
GA 125
IN 121
OH 114
AR 113
KY 104
AL 86
TN 81
VA 80
WA 80
AZ 79
NJ 75
CO 55
WV 52
HI 50
KS 50
DC 49
MA 43
AK 39
OR 36
IA 36
NC 31
SC 29
SD 22
NV 19
ME 15
NE 14
NH 10
UT 8
CT 7
ID 2
MT 1
RI 1
WY 1
VT 0
DE 0
ND 0

Key Takeaways

  • A total of 8,151 (5% of total data) were identified as outlier.
  • The combination by average method provides a consolidated view of anomaly scores across multiple KNN models.
  • Both training and test datasets show a majority of instances classified as normal, with a smaller percentage identified as outliers.
  • The average anomaly scores for the outlier group are consistently higher than those for the normal group, indicating effective identification of anomalies by the combined model.
  • 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC and NY state appears to see the most fraudulent activities.

Section 5: Principal Component Analysis (PCA) ¶

5.1 PCA¶

PCA helps in reducing the number of features while retaining the most important information. This approach method reduces computational cost by reducing the number of dimensions. By reducing the number of features, it helps in mitigating overfittinh in our models.

In [92]:
pca = PCA(contamination=0.05) 
pca.fit(X_train)

# get the prediction labels and outlier scores of the training data
y_train_pred_pca = pca.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores_pca = pca.decision_scores_  # .decision_scores_ yields the raw outlier scores for the training data
y_train_scores_pca = pca.decision_function(X_train) 
y_train_pred_pca = pca.predict(X_train) 

y_test_scores_pca = pca.decision_function(X_test) 
y_test_pred_pca = pca.predict(X_test)
In [93]:
pca.get_params()
Out[93]:
{'contamination': 0.05,
 'copy': True,
 'iterated_power': 'auto',
 'n_components': None,
 'n_selected_components': None,
 'random_state': None,
 'standardization': True,
 'svd_solver': 'auto',
 'tol': 0.0,
 'weighted': True,
 'whiten': False}
In [94]:
[pca.explained_variance_,
pca.explained_variance_ratio_]
Out[94]:
[array([3.85181529, 1.9101038 , 1.66358007, 0.91376819, 0.85387575,
        0.48865034, 0.233276  , 0.04869219, 0.02381848, 0.01249654]),
 array([0.38517858, 0.19100892, 0.16635673, 0.09137612, 0.08538692,
        0.04886466, 0.02332742, 0.00486918, 0.00238183, 0.00124964])]
In [95]:
pca_threshold = pca.threshold_
print("The threshold for the defined comtanimation rate:" , pca.threshold_)
The threshold for the defined comtanimation rate: 10158.968465695625
In [96]:
stats_pca_train=descriptive_stat_threshold(pd.DataFrame(X_train, columns=ratio_columns), y_train_scores_pca, pca_threshold)
stats_pca_test=descriptive_stat_threshold(pd.DataFrame(X_test, columns=ratio_columns), y_test_scores_pca, pca_threshold)

# Descriptive statistics for train data
print("Training Data Statistics:")
stats_pca_train
Training Data Statistics:
Out[96]:
Group Count Count % Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Anomaly_Score
0 Normal 123929 94.999693 0.98 0.98 0.99 1.00 0.98 0.99 1.00 0.97 0.99 1.0 3362.86
1 Outlier 6523 5.000307 1.40 1.33 1.13 1.01 1.36 1.14 1.01 1.60 1.16 1.0 14465.32
In [97]:
# Descriptive statistics for test data
print("Test Data Statistics:")
stats_pca_test
Test Data Statistics:
Out[97]:
Group Count Count % Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Anomaly_Score
0 Normal 31028 95.139975 0.98 0.98 0.99 1.00 0.98 0.99 1.00 0.97 0.99 1.00 3364.83
1 Outlier 1585 4.860025 1.39 1.32 1.12 1.01 1.35 1.13 1.01 1.62 1.18 1.01 14402.64
In [98]:
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
    fig = go.Figure()

    # Split the scores into two parts: below the threshold and above/equal to the threshold
    below_threshold = scores[scores < threshold]
    above_threshold = scores[scores >= threshold]

    # Add histogram traces for the two parts
    fig.add_trace(go.Histogram(x=below_threshold, name='Below Threshold', marker=dict(color='#09b18e')))
    fig.add_trace(go.Histogram(x=above_threshold, name='Above Threshold', marker=dict(color='#8f63e2')))
    
    # Add vertical line annotating the threshold
    fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
                  annotation_position="top right", annotation=dict(font=dict(color="#0d5087", size=12)))

    # Update layout
    fig.update_layout(
        title=title,
        xaxis_title="Outlier Score",
        yaxis_title="Frequency",
        showlegend=True,
        height=500,
        width=800,
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )

    # Show the plot
    fig.show()

# Plotting the combination by average for training data
plot_histogram_with_threshold(y_train_scores_pca, "PCA - Training Data", pca_threshold)

# Plotting the combination by average for test data
plot_histogram_with_threshold(y_test_scores_pca, "PCA - Test Data", pca_threshold)

5.1.1 PCA Outlier List¶

In [99]:
# Add a column 'Prediction_PCA' to final_df and initialize with 0 (not outlier)
final_df['Prediction_PCA'] = 0

# Mark outliers in training data
final_df.loc[X_train.index, 'Prediction_PCA'] = y_train_pred_pca

# Mark outliers in test data
final_df.loc[X_test.index, 'Prediction_PCA'] = y_test_pred_pca
final_df.head()
Out[99]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges ... Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction Prediction_Comb_Avg Prediction_PCA
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 1.0 1.005855 1.037810 1.0 0.885921 1.033356 1.0 0 0 0
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 ... 1.0 1.007653 1.000000 1.0 0.708782 1.000000 1.0 0 0 0
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 ... 1.0 0.946260 1.000000 1.0 0.857643 1.000000 1.0 0 0 0
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 ... 1.0 0.943232 0.915208 1.0 1.126205 0.970586 1.0 0 0 0
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 ... 1.0 0.985152 1.000000 1.0 0.705312 1.000000 1.0 0 0 0

5 rows × 26 columns

In [100]:
# Filter out outliers
outliers_pca = final_df[final_df['Prediction_PCA'] == 1]

# Display the list of outliers
outliers_list_pca = outliers_pca[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_PCA']]
print("List of Outliers:")
outliers_list_pca.head(5)
List of Outliers:
Out[100]:
DRG Provider_Id Provider_Name Provider_State Provider_City Provider_Zipcode Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction_PCA
104 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 50334 SALINAS VALLEY MEMORIAL HOSPITAL CA SALINAS 93901 1.269125 1.025018 1.000000 1.0 1.996299 1.000000 1.0 6.424064 1.000000 1.0 1
245 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 110010 EMORY UNIVERSITY HOSPITAL GA ATLANTA 30322 1.090453 1.179870 1.216306 1.0 1.514428 1.408846 1.0 2.853551 1.908965 1.0 1
280 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 140008 LOYOLA GOTTLIEB MEMORIAL HOSPITAL IL MELROSE PARK 60160 0.792296 0.817234 1.000000 1.0 2.497097 1.000000 1.0 7.042845 1.000000 1.0 1
294 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 140080 RHC ST FRANCIS HOSPITAL IL EVANSTON 60202 1.248071 1.287354 1.103888 1.0 2.351565 1.390816 1.0 5.231343 1.681938 1.0 1
347 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 150056 INDIANA UNIVERSITY HEALTH IN INDIANAPOLIS 46206 1.269749 1.418832 1.359352 1.0 1.518408 1.450774 1.0 1.857759 1.758615 1.0 1
In [101]:
print(f' The PCA Outlier list contains {len(outliers_list_pca.groupby("Provider_Name").count())} providers who performed {len(outliers_list_pca.groupby("DRG").count())} procedures across {len(outliers_list_pca.groupby("Provider_State").count())} states', '\n')
 The PCA Outlier list contains 895 providers who performed 100 procedures across 51 states 

5.1.2 Top 10 Outliers by DRG¶

In [102]:
outliers_list_pca.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
Out[102]:
count
DRG
470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC 158
291 - HEART FAILURE & SHOCK W MCC 152
208 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT <96 HOURS 147
853 - INFECTIOUS & PARASITIC DISEASES W O.R. PROCEDURE W MCC 143
329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 142
682 - RENAL FAILURE W MCC 135
392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC 134
292 - HEART FAILURE & SHOCK W CC 132
314 - OTHER CIRCULATORY SYSTEM DIAGNOSES W MCC 132
194 - SIMPLE PNEUMONIA & PLEURISY W CC 131

5.1.3 Outliers by State¶

In [103]:
outliers_list_pca.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)
Out[103]:
count
Provider_State
NY 1073
CA 1054
TX 923
IL 501
FL 429
PA 378
MO 317
OK 287
MD 258
NM 181
MS 178
IN 175
VA 163
LA 157
OH 146
MN 141
GA 134
AR 130
KY 126
AZ 124
TN 108
SC 101
WI 97
NJ 89
WA 88
MI 82
AL 78
WV 77
CO 70
DC 56
OR 49
MA 45
IA 42
HI 38
NC 37
AK 32
NV 24
NE 23
KS 23
SD 22
ME 20
UT 13
NH 9
CT 7
ID 2
RI 1
ND 0
MT 0
VT 0
DE 0
WY 0

Key Takeaways

  • A total of 8,180 (5% of total data) were identified as outlier.
  • The average anomaly score for outliers (14,465.32) is much higher compared to the normal group (3,362.86), indicating that outliers exhibit more extreme deviations in their feature values.
  • 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC and NY state appears to see the most fraudulent activities.

Section 6: Model Comparison¶

In [104]:
########
# HBOS #
########
from pyod.models.hbos import HBOS
n_bins = 50
hbos = HBOS(n_bins=n_bins,contamination=0.05) 
hbos.fit(X_train)
y_train_hbos_scores = hbos.decision_function(X_train)
y_test_hbos_scores = hbos.decision_function(X_test)

########
# ECOD #
########
from pyod.models.ecod import ECOD
clf_name = 'ECOD'
ecod = ECOD(contamination=0.05) 
ecod.fit(X_train)
y_test_ecod_scores = ecod.decision_function(X_test)

########
# PCA  #
########
from pyod.models.pca import PCA
pca = PCA(contamination=0.05) 
pca.fit(X_train)
y_test_pca_scores = pca.decision_function(X_test)

########
# KNN ##
########
knn = KNN(contamination=0.05)
knn.fit(X_train)
y_test_knn_scores = knn.decision_function(X_test)

# Thresholds
[ecod.threshold_, hbos.threshold_, pca.threshold_, knn.threshold_]
Out[104]:
[26.6950577459844, 3.340918013562263, 10158.968465695625, 0.20077006129635047]
In [105]:
# Put the actual, the HBO score and the ECOD score together
Actual_preds = pd.DataFrame({'HBO_Score': y_test_hbos_scores, 
                             'ECOD_Score': y_test_ecod_scores, 
                             'PCA_Score': y_test_pca_scores, 
                             'KNN_Score': y_test_knn_scores})
Actual_preds['HBOS_pred'] = np.where(Actual_preds['HBO_Score']>hbos.threshold_,1,0)
Actual_preds['ECOD_pred'] = np.where(Actual_preds['ECOD_Score']>ecod.threshold_,1,0)
Actual_preds['PCA_pred'] = np.where(Actual_preds['PCA_Score']>pca.threshold_,1,0)
Actual_preds['KNN_pred'] = np.where(Actual_preds['KNN_Score']>knn.threshold_,1,0)
Actual_preds.head()
Out[105]:
HBO_Score ECOD_Score PCA_Score KNN_Score HBOS_pred ECOD_pred PCA_pred KNN_pred
0 -26.765132 7.122095 2450.436977 0.014515 0 0 0 0
1 -26.765132 8.352012 2660.476899 0.026798 0 0 0 0
2 -24.073842 6.490236 2021.072391 0.018293 0 0 0 0
3 -27.563327 5.262983 2137.255685 0.011209 0 0 0 0
4 -25.121238 5.724757 2047.587991 0.025255 0 0 0 0

HBOS vs PCA¶

In [106]:
pd.crosstab(Actual_preds['HBOS_pred'],Actual_preds['PCA_pred'])
Out[106]:
PCA_pred 0 1
HBOS_pred
0 30364 685
1 664 900
In [107]:
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['HBOS_pred'],Actual_preds['PCA_pred'])

# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm, 
                        x=['Predicted 0', 'Predicted 1'], 
                        y=['True 0', 'True 1'], 
                        showscale=False, 
                        colorscale=[
                            [0.0, "#09b18e"],  
                            [0.0015, "#0d5087"],  
                            [1.0, "#ffffff"], 
                        ],
                        xgap=20,
                        ygap=20,
                        text=cm,
                        texttemplate="%{text}")

# Define layout
layout = go.Layout(
    title=dict(text="Confusion Matrix between HBOS and PCA", x=0.5, y=0.9, xanchor='center', yanchor='top'),
    xaxis=dict(title='PCA Pred', showticklabels=True),
    yaxis=dict(title='HBOS Pred', showticklabels=True),
    autosize=False,
    width=500,
    height=500,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'
)

# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()

ECOD vs PCA¶

In [108]:
pd.crosstab(Actual_preds['ECOD_pred'],Actual_preds['PCA_pred'])
Out[108]:
PCA_pred 0 1
ECOD_pred
0 30552 447
1 476 1138
In [109]:
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['ECOD_pred'],Actual_preds['PCA_pred'])

# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm, 
                        x=['Predicted 0', 'Predicted 1'], 
                        y=['True 0', 'True 1'], 
                        showscale=False, 
                        colorscale=[
                            [0.0, "#09b18e"],  
                            [0.0015, "#0d5087"],  
                            [1.0, "#ffffff"], 
                        ],
                        xgap=20,
                        ygap=20,
                        text=cm,
                        texttemplate="%{text}")

# Define layout
layout = go.Layout(
    title=dict(text="Confusion Matrix between ECOD and PCA", x=0.5, y=0.9, xanchor='center', yanchor='top'),
    xaxis=dict(title='PCA Pred', showticklabels=True),
    yaxis=dict(title='ECOD Pred', showticklabels=True),
    autosize=False,
    width=500,
    height=500,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'
)

# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()

KNN vs PCA¶

In [110]:
pd.crosstab(Actual_preds['KNN_pred'],Actual_preds['PCA_pred'])
Out[110]:
PCA_pred 0 1
KNN_pred
0 30393 584
1 635 1001
In [111]:
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['KNN_pred'],Actual_preds['PCA_pred'])

# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm, 
                        x=['Predicted 0', 'Predicted 1'], 
                        y=['True 0', 'True 1'], 
                        showscale=False, 
                        colorscale=[
                            [0.0, "#09b18e"],  
                            [0.017, "#0d5087"],  
                            [1.0, "#ffffff"], 
                        ],
                        xgap=20,
                        ygap=20,
                        text=cm,
                        texttemplate="%{text}")

# Define layout
layout = go.Layout(
    title=dict(text="Confusion Matrix between KNN and PCA", x=0.5, y=0.9, xanchor='center', yanchor='top'),
    xaxis=dict(title='PCA Pred', showticklabels=True),
    yaxis=dict(title='KNN Pred', showticklabels=True),
    autosize=False,
    width=500,
    height=500,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'
)

# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()

Key Takeaways

  • KNN and PCA performance are similar with minimal differences.
  • All models were able to identify more than 30k instances as normal along with 900-1,140 instances as outliers with some discrepancies between false negatives and false positive.